Multidimensional scaling methods can reconstruct genomic DNA loops using Hi-C data properties

This paper proposes multidimensional scaling (MDS) applied to high-throughput chromosome conformation capture (Hi-C) data on genomic interactions to visualize DNA loops. Currently, the mechanisms underlying the regulation of gene expression are poorly understood, and where and when DNA loops are formed remains undetermined. Previous studies have focused on reproducing the entire three-dimensional structure of chromatin; however, identifying DNA loops using these data is time-consuming and difficult. MDS is an unsupervised method for reconstructing the original coordinates from a distance matrix. Here, MDS was applied to high-throughput chromosome conformation capture (Hi-C) data on genomic interactions to visualize DNA loops. Hi-C data were converted to distances by taking the inverse to reproduce loops via MDS, and the missing values were set to zero. Using the converted data, MDS was applied to the log-transformed genomic coordinate distances and this process successfully reproduced the DNA loops in the given structure. Consequently, the reconstructed DNA loops revealed significantly more DNA-transcription factor interactions involved in DNA loop formation than those obtained from previously applied methods. Furthermore, the reconstructed DNA loops were significantly consistent with chromatin immunoprecipitation followed by sequencing (ChIP-seq) peak positions. In conclusion, the proposed method is an improvement over previous methods for identifying DNA loops.

[27]. ATAC-seq helps identify open chromatin regions and assess transcription factor occupancy. DNA loops have many arbitrary protein interventions to promote or regulate transcription, and three spatial spaciousness, so selected DNA loops and ChIP-seq and ATAC-seq should match the peaks in the data. Therefore, the DNA loop regions selected by the MDS-based method were rechecked to determine whether they matched the locations of ChIP-seq or ATAC-seq peaks. When a partial match was found between the position of the top 300 peaks and the DNA loop regions selected by the MDS-based method, the loop was counted. Fisher's exact probability test was then performed to determine whether the selected DNA loops significantly matched the peaks compared to randomly selected loops.' 2. How did the author find the logarithmic weight in Eq. (1)? What is the base of the logarithm? If the base is 5, the equations should make sense.

A.)
It is the base e of the natural logarithm. Indeed, 50,000 bp was used as a cut-off value, but it is unnecessary to set the base of the logarithm to 5. Hi-C data with distances between genomic coordinates of less than 50,000 bp were not changed because the values are large by default. The only important pre-processing was to reduce the gap by multiplying equation (A) by the Hi-C contact frequency, although the distances are far apart. This part needed to be explained better, so the following additions were made. Also, the word "log" has been corrected to "loge" to avoid misunderstandings. L110: 'Hi-C contact frequency between distant coordinates that form DNA loops are underestimated compared to Hi-C contact frequency between initially close coordinates [7]. The only important pre-processing step was to reduce the gap between Hi-C contact frequencies using the multiplying equation (1).' 3. The base of the logarithm of the Y-axis in Figs. 1 and 2 should be described.

A.)
In the title of Figs 2 and 3, it was corrected to 'natural logarithm' instead of 'logarithm'.
Fig2: Plot of the distance between coordinates versus the natural logarithm of Hi-C contact frequency before multiplying by weight Fig3: Plot of the distance between coordinates versus the natural logarithm of Hi-C contact frequency after multiplying by weight 4. The reviewer could not understand the data format in Supplementary Files. For example, the file "GSE141067(osteosarcoma)/0min/SVD_234/plot.chr1.0-4980.txt" has 5 columns and 2989 rows.

A.)
The data format has been improved. The data is a three-column data frame with SVD eigenvectors v2,v3,and v4. 5. l. 130 and Eq. (3): The author converted the contact frequency into the distance without any references.
Although the assumption is not theoretically supported in terms of polymer physics [Shinkai et al. NARGAB 2020], some computational methods have conventionally assumed [Serra et al. FEBS Lett 2015].

A.)
A citation is added for the adoption of the inverse of the interaction frequency. A polymer model such as PHi-C was not used because it is highly dependent on the performance of the Hi-C data.
L139: 'PHi-C' is a method to reproduce genome structure by considering a simple macromolecular model of the genome (a single chromosome) as a 'linked bead' [21]. PHi-C accurately estimates genome structure in four dimensions but requires dense Hi-C data. In this case, a versatile model was desired, so the reciprocal frequency of interactions between two particles in the model (i.e., between two bins or restriction fragments in the genome) was adopted as the distance. The inverse of the power of the frequency is debatable, but it was set to 1 as used in fractal globule models [22].  This implies a root in the DNA loop. In other words, it is an enhancer or promoter.
The description of the selection criteria for DNA loops was unclear and has been corrected as follows: L191: Formulaically, let T be the threshold, and Ei' be the distance between genomic coordinates, where i is the coordinates divided by the Hi-C resolution. Moreover, nk is the coordinate such that E'nk-E'nk-1 <0 and 0< E'nk+1-E'nk. Then, for any i, n'k is defined as the nk with nk <i< nk+1 that satisfies Ei'<T. Finally, the DNA loop was obtained from n'k-1 to n'k+1.

→
Formulaically, let T be the threshold, Ei' greater than the threshold T is denoted as E + i and is a DNA loop, where i is the coordinates divided by the Hi-C resolution. On the other hand, Ei' smaller than the threshold value is denoted as Ei. When proceeding from the both sides of DNA loop E + i to directions i+j and i-j (j is any natural number) that are smaller than the threshold value T, the root of the DNA loop is defined as Ei -Ei-j <0 and 0< E -i+j -E -I. 8. Eq. (7): The variable "V" seems to be a new one. What is the difference between "v" and "V"?

A.)
'V' is the unitary matrix of the singular value decomposition and 'v' is its column vector. Equation 7 has been modified because 'V' and 'v' have been mixed. A.) As MDS is reproducible, DNA loops can be uniquely determined, like singular value decomposition. However, it depends on how many parts of the Hi-C data are split. In this study, from the point of view of the computational complexity of the multidimensional scaling construction method, the Hi-C data were split into 1 to 5 pieces for analysis.
The description of the selection criteria for DNA loops was unclear and has been corrected as follows: L191: Formulaically, let T be the threshold, and Ei' be the distance between genomic coordinates, where i is the coordinates divided by the Hi-C resolution. Moreover, nk is the coordinate such that E'nk-E'nk-1 <0 and 0< E'nk+1-E'nk. Then, for any i, n'k is defined as the nk with nk <i< nk+1 that satisfies Ei'<T. Finally, the DNA loop was obtained from n'k-1 to n'k+1.
→ Formulaically, let T be the threshold, Ei' greater than the threshold T is denoted as Ei + and is a DNA loop, where i is the coordinates divided by the Hi-C resolution. On the other hand, Ei' smaller than the threshold value is denoted as Ei -. When proceeding from the both sides of DNA loop Ei + to directions i+j and i-j (j is any natural number) that are smaller than the threshold value T, the root of the DNA loop is defined as Ei --Ei-1 -< 0 and 0 < Ei+1 --Ei -.
The following flowchart of the algorithm has been added.  →Fig.6 Distance plot between coordinates of chromosome 5 50,000-100,000 kbp of series GSM6061774 by the MDS-based method.
11. l. 207: The author describes, "The number of DNA loops and number of genes are close to each other," referring to Table 1. However, the number of DNA loops is almost 30000, and the number of genes is almost 5000 in Table 1. These are NOT close to each other. A.) I wrote that 'The number of DNA loops and number of genes are close to each other', but this was a mistake. The number of DNA loops is almost the same for replicates, and the number of genes is almost the same for replicates.
The following has been corrected. A.) H3K27me3 is involved in heterochromatin but is also found to be abundant in DNA loops and represses gene expression. Therefore, I analyzed the H3K27me3 ChIP-seq data and found that the positions of the peaks in the H3K27me3ChIP-seq data and CTCF and H3K27ac significantly coincided with the positions of the selected DNA loops.
L219: 'ChIP-seq data for H3K27ac, H3K27me3, and CTCF were analyzed; H3K27ac is an active enhancer marker, CTCF is involved in DNA loops, H3K27me3 is involved in heterochromatin but is also found to be abundant in DNA loops and represses gene expression [28].' 13. Figure 7 is not attached to the manuscript.

A.)
Unnecessary latex codes were listed and misleading (Fig5). It has been deleted.
Reviewer #2: In this manuscript, Ryo utilizes multidimensional scaling (MDS) to capture DNA loops for highthroughput chromosome conformation capture (HiC) data. By transforming HiC data to a distance matrix, MDS identified DNA loops contained more DNA bound by transcription factors than existing methods. The manuscript is well organized.
I have 4 major concerns.
1. In section 2.2.2, the term "exaggerated representations of DNA loops" is confusing. Could you clarify the definition and rationale behind it? I am also concerned about the threshold of 50kbp for formula 1. Would the significant loops of length 50kbp be affected by this transformation? A.) "Exaggerated representations of DNA loops" was confusing. Therefore, instead of "exaggerated representations of DNA loops", the term "preprocessing to clearly represent DNA loops" was used.
'Clear representations of DNA loops' means that the Hi-C data is transformed using equation (1), and MDS is applied to reproduce a genome structure that shows DNA loops more clearly than if no pre-processing is applied to the Hi-C data.
In equation (1), 50 kb was set as the cut-off value. No extra processing was added to the Hi-C contact frequencies within 50 kb of the genomic co-ordinates, as these are large by default; for Hi-C contact frequencies above 50 kb, the gap in the Hi-C data was filled by multiplying by the natural logarithm of the distance. For example, changing the cut-off value of 50 kb to 70 kb does not significantly change the structure of the genome. However, setting the cut-off to 20 kb or less will change the genome structure because the Hi-C contact frequency below 20 kb is smaller because of the natural logarithm. The present study found that setting the cut-off value to 50 kb reproduces the genome structure, which clearly represents the DNA loop, although there is no strict biological basis for this. L121: For Hi-C contact frequencies above 50 kbp, the gap in the Hi-C data was filled by multiplying by the natural logarithm of the distance. For example, changing the cut-off value of 50 kbp to 70 kbp does not significantly change the structure of the genome. However, setting the cut-off to 20 kbp or less will alter the genome structure because a Hi-C contact frequency below 20 kbp will be low because of the natural logarithm.
The present study found that setting the cut-off value of 50 kbp reproduces a genome structure that clearly represents the DNA loop, although there is no strict biological basis for this.
2. In section 2.2.3, the authors claimed that the second and the third eigenvectors often retain the original structure. Empirically, how would one decide for real HiC data which eigenvectors to use for identifying loops? Why is the first eigenvector excluded? A.) Empirically, the second and third eigenvectors v2, v3 after eigenvalue decomposition frequently retain their original structure. This was verified in my previous paper with a simple simulation. Therefore, after eigenvalue decomposition, v2 and v3 of the eigenvector V were chosen as the genomic structure. In the following, two simple simulations were performed.
The column of the orthogonal matrix U presenting the desired structure is unclear. However, in many cases, the second and third columns of the orthogonal matrix U are thought to present the considering structure. the first one is 500 random numbers ranging from -100 to 100 and the second one is a circle composed of random numbers, such as x i1 = cos (2π i N ) + g(λ), x i2 = sin (2π i N ) + g(λ). Here, the function g is the error function and is defined as follows: g(λ) = λ , 0 ≤ λ ≤ 0.5 The black dots are the original coordinates, and the red are the v2 and v3 plots from MDS. As can be seen, by this simple simulation, it is empirically found that the second and third eigenvectors represent the original coordinates.
Therefore, the second and third eigenvector after SVD was also adopted as the desired structure in this study.
3. In formula 8, the distance matrix is smoothed with a window size of 100kbp. Has the author evaluated the performance of MDS varying different window sizes? For other HiC data, how would the user recommend selecting this hyperparameter? A.) Smoothing is debatable. For example, 40kbp of smoothing worked well, and without smoothing, the loop could not be fully selected. Future research is needed to find a biologically valid smoothing. In this study, smoothing was performed at 10kbp, which is neither too long nor too short. Further investigation is needed when dealing with other resolutions.
L183: 'This smoothing was performed at 100 kbp, which is neither too long nor too short. Further investigation is needed when dealing with other resolutions.' 4. In table 1, is 318380 the number of all potential DNA loops considered? What do the ratios in two columns represent? # of loops seems much bigger than # of genes in table 1, which contradicts the statement in lines 207-210, "the number of loops and number of genes are close to each other?" Could you quantify this? A.) Fisher's exact probability test was performed to determine whether the enhancer-promoter regions in this study were significantly consistent with selected DNA loop regions compared to the randomly selected one. The following representation is misleading and has been revised.
L216: In the present study, Fisher's exact probability test was performed to determine whether the enhancerpromoter regions in this study were significantly consistent with random selections.
→Fisher's exact probability test was then performed to determine whether the selected DNA loops significantly matched the peaks compared to randomly selected loops.
In GSM6061774, 318,380,000 bp were selected as loop regions, of which 2,410,018 bp were partially matched to peaks of the ChIP-seq data. The expected value was determined by multiplying the proportion of the top 300 peak regions (peak regions divided by total genomic regions) by the length of the loop region. The p-value for Fisher's exact probability test was 2.2×10 -16 , and the odds ratio was 39.33, indicating significant agreement.
L343: In GSM6061774, 318,380,000 bp were selected as loop regions, of which 2,410,018 bp were partially matched to peaks of the ChIP-seq data. The expected value was determined by multiplying the proportion of the top 300 peak regions (peak regions divided by total genomic regions) by the length of the loop region. The pvalue for Fisher's exact probability test was 2.2 × 10 -16 , and the odds ratio was 39.33, indicating significant agreement. Other data were also in significant agreement (Supplementary Files).
I wrote that 'The number of DNA loops and number of genes are close to each other', but this was a mistake. The number of DNA loops is almost the same for replicates, and the number of genes is almost the same for replicates.
The following has been corrected. The description of the selection criteria for DNA loops was unclear and has been corrected as follows: L191: Formulaically, let T be the threshold, and Ei' be the distance between genomic coordinates, where i is the coordinates divided by the Hi-C resolution. Moreover, nk is the coordinate such that E'nk-E'nk-1 <0 and 0< E'nk+1-E'nk. Then, for any i, n'k is defined as the nk with nk <i< nk+1 that satisfies Ei'<T. Finally, the DNA loop was obtained from n'k-1 to n'k+1.
→ Formulaically, let T be the threshold, Ei' greater than the threshold T is denoted as Ei + and is a DNA loop, where i is the coordinates divided by the Hi-C resolution. On the other hand, Ei' smaller than the threshold value is denoted as Ei -. When proceeding from the both sides of DNA loop Ei + to directions i+j and i-j (j is any natural number) that are smaller than the threshold value T, the root of the DNA loop is defined as Ei --Ei-1 -< 0 and 0 < Ei+1 --Ei -.
2) In the Results section, the presentation of predicted loops in each dataset is too short to evaluate its merit. In Particular, the results have to be compared with other similar tools for predicting DNA loops by using Hi-C sequencing datasets. And also, I am not clear about how other genomic information (e.g., enhancer or promoter histone markers and nucleosome density) at these predicted loops, which are important markers for validating the predicted DNA loops/interactions, are functional or random interactions.

A.)
This manuscript compares this method with my old method and miniMDS. In the results section, I mentioned that the present method gave better enrichment analysis results and more genes were included in the selected DNA loops. I tried another method, StoHi-C, which clearly outputs undesirable structures (supplemental files).
All chromatin structures are shown in the supplemental file. 3) In the Discussions section, the author needs to improve this part significantly to convince people that the proposed new method is useful for studying DNA loops based on Hi-C data. A.) The discussion section has been greatly improved to convince the reader.
L376: 'One of the main goals of this study was to identify a method of reconstructing the genomic structure that differentiates DNA with and without loops. As indicated in the results, the validity of the proposed method is significantly consistent. To my knowledge, no existing study has been able to confirm the consistency of the DNA loop to this extent. Therefore, since the consistency of (i)-(iii) was confirmed in a data-driven manner, this method is useful for reproducing DNA loops.' 4) Finally, the quality of all main figures is not good, which has to be improved significantly. A.) All figures have been improved.